In this project we’ll be exploring some chemical ingredients of red wine, and how does these affect the quality of our wine.
First of all let’s check out what are these ingredients, by looking at the column names and their types.
names(wine)
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
str(wine)
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Let’s now look at how many wines of each quality do we have, we can notice that the most quality we have in our data is quality 5 with 681 wines, and the second is quality 6 with 638.
Then we have 199 wines of quality 7, 53 of 4, 10 of 3 and 18 of 8.
So basically most of our wines lie between qality 5 and 6.
ggplot(aes(x=quality), data=wine)+
geom_histogram(binwidth=.1, fill='blue')+
scale_x_continuous(breaks=c(3:8))+
scale_y_continuous(breaks=seq(0, 700, 50))+
ggtitle('Wine Quality Counts')
I would like to start by exploring what are the relationships between the variables in my data. Best thing to do that is to explore our data using ggpairs function in the GGally library.
ggpairs(wine, binwidth=1)+
theme(panel.border =
element_rect(linetype = "dashed", colour = "blue", fill = NA))
Looking at the output generated by ggpairs function, we can notice that the wine quality is somewhat positively related to the amount of alcohol, that’s why looking at the scatterplot for variables alcohol and quality we can see the points moving a bit to the right (higher alcohol percentage) as we move up towards a better quality.
Another thing we notice is that quality is somewhat negatively related to volatile.acidity that’s why the points move a bit left whenever the quality goes higher.
Other than these 2 variables nothing seems to be really affecting quality in our variables, or at least not as much as alcohol and volatile acidity.
We’ll have to explore more…
Before continuing, I want to create a new column in our data set, which is the same as the quality column but as a factor.
wine$quality.factor=factor(wine$quality)
ggplot(aes(x=round(alcohol, digits=1)), data=wine)+
geom_bar(binwidth=.1, fill='blue')+
scale_x_continuous(limits=c(9, 13.7), breaks=seq(9, 13.7, .1))+
scale_y_continuous(limits=c(0, 140), breaks=seq(0, 140, 10))+
xlab('Alcohol round to 1 digit after the decimal')+
ggtitle('Wine by Alcohol Counts')
## Warning: position_stack requires constant width: output may be incorrect
An interesting thing that I would like to do, is look at how our wines are separated by pH. Let’s first create groups of pH, in intervals of .5 and check how much each group will contain.
wine$pH.groups=cut(wine$pH, breaks=c(2.5, 3, 3.5, 4, 4.5))
ggplot(aes(x=pH.groups), data=wine)+
geom_histogram(fill='blue', width=1)+
scale_y_continuous(breaks=seq(0, 1400, 50))
We noticed before that the quality is mostly affected by the amount of alcohol in the wine, and volatile acidity. Let’s explore these 2 variables in a plot.
ggplot(aes(x=alcohol, y=quality, color=volatile.acidity),
data=wine)+
geom_point(size=4, position="jitter")+
scale_color_gradient(low="#1d0000", name="Volatile Acidity", high="#ff6161")+
ggtitle('Quality vs. Alcohol colored by Volatile Acidity')
ggplot(aes(x=citric.acid, y=fixed.acidity,
color=volatile.acidity, size=pH), data=wine)+
geom_jitter()+
scale_y_continuous(limits=c(5,12), breaks=seq(5,12,1))+
scale_x_continuous(limits=c(0, 0.75))+
scale_color_gradient(low="#1d0000", name="Volatile Acidity", high="#ff6161")+
scale_size(name="pH", range = c(2, 8))+
ggtitle('Fixed Acidity vs. Citric Acid
Colored by Volatile Acidity Sized by pH')
## Warning: Removed 134 rows containing missing values (geom_point).
ggplot(aes(x=fixed.acidity, y=density, color=citric.acid), data=wine)+
geom_jitter(size=3)+
geom_smooth(method='loess')+
scale_color_gradient(low="#1d0000", name="Citric Acid", high="#ff6161")+
ggtitle('Density vs. Fixed Acidity')
On the other hand, again looking at the ggpairs output, we can notice that desnity is negatively related to alcohol, so the higher the alcohol the lower the density. Let’s look into that.
ggplot(aes(x=alcohol, y=density), data=wine)+
geom_line(stat='summary', fun.y=median, color='blue')+
geom_smooth(method="loess")+
ggtitle('Density vs. Alcohol')
p1=ggplot(aes(x=alcohol, y=density), data=wine)+
geom_line(stat='summary', fun.y=median, color='blue')+
ggtitle('Density vs. Alcohol (blue) - Density vs. Fixed Acidity (red)')
p2=ggplot(aes(x=fixed.acidity, y=density), data=wine)+
geom_line(color='red', stat='summary', fun.y=median)
grid.arrange(p1, p2, ncol=1)
ggplot(aes(x=residual.sugar, y=density), data=subset(wine, residual.sugar<=4))+
geom_point()+
scale_x_continuous(breaks=seq(0, 4, .1))+
geom_smooth(method='loess')+
ggtitle('Density vs. Residual Sugar')
One last thing I would like to look at is free sulfur dioxide against total sulfur dioxide. According to this article sulfur dioxide is an important ingredient that existed since the beggining of time of wine. So let’s take a look into that.
ggplot(aes(x=free.sulfur.dioxide, y=total.sulfur.dioxide,
color=quality.factor),
data=subset(wine, free.sulfur.dioxide<=55 & total.sulfur.dioxide<=150))+
geom_point()+
scale_color_manual(values=rev(brewer.pal(9, "Blues")), name="Quality")+
scale_size_discrete(name="Quality")+
ggtitle('Total Sulfur Dioxide vs. Free Sulfur Dioxide,
Colored and Sized by Quality')
I was wondering why are volatile acidity and citric acid negative correlated, so I did some research online to understand. I ended up understading that a wine with a low acidity is considered as boring, and one is very high acidity will lead to a sour wine. That’s why wine makers have to know how to balance the acidity of the wine by balancing these 2 acids, volatile and citric.
Article link
Here’s a plot to show us the relation between citric and volatile acidity:
ggplot(aes(x=volatile.acidity, y=citric.acid,
color=quality.factor),
data=subset(wine, volatile.acidity>=.2 & volatile.acidity<=1.2))+
geom_point(stat='summary', fun.y=median)+
scale_color_manual(values=rev(brewer.pal(9, "Blues")), name="Quality")+
scale_size_discrete(name="Quality")+
scale_x_continuous(breaks=seq(0.2, 1.2, .2))+
ggtitle('Volatile Acidity vs. Citric Acid, colored and sized by Quality')
Let’s see if we can prove this, by relooking at the plot we did above for our pH.groups but this time, let’s add color by quality to it, let’s see where do most higher quality wines fall.
ggplot(aes(x=pH, fill=quality.factor, color=quality.factor), data=wine)+
geom_density(aes(alpha=.7))+
scale_x_continuous(breaks=seq(2.7, 4, .1))+
scale_y_continuous(breaks=seq(0, 1400, 50))
Let’s look in another way on how alcohol affects the quality of wines.
wine_by_alcohol=wine %>% group_by(alcohol) %>%
summarize(median_quality=median(quality))%>%arrange(median_quality)
ggplot(aes(x=alcohol, y=median_quality), data=wine_by_alcohol)+
geom_point()+
geom_smooth(method='loess')+
ggtitle('Median Quality vs. Alcohol')
Now let’s take a closer look at volatile acidity alone, see what we can deduce.
ggplot(aes(x=volatile.acidity , y=quality), data=wine)+
geom_point(stat="summary", fun.y=median)+
geom_smooth(method='loess')+
ggtitle('Median Quality vs. Volatile acidity')
Our first model will be the linear regression model, then we’ll plot our predictions against the actual quality of each wine:
Let’s first define 3 functions that will help us test the accuracy of our model.
First function “prediction_testing_and_group” will take a dataframe as a parameter, in this dataframe it will add a column quality.prediction.equal which will be True if our prediction is correct, and false if not. Then this function will create a new dataframe which is created by grouping the provided dataframe by quality, quality.prediction, and quality.prediction.equal, and an addition column named counts which is the count of items in this group.
Second function will plot False and True predcitions and size them by their counts.
Third function will calculate accuracy.
prediction_test_and_group=function(df){
df$quality.prediction.equal=
ifelse(df$quality==
df$quality.prediction, TRUE, FALSE)
df$quality.prediction.equal=
factor(df$quality.prediction.equal)
return_df=df%>%group_by(quality.factor, quality.prediction,
quality.prediction.equal)%>%summarize(counts=n())
}
plot_predictions=function(df, title){
ggplot(aes(x=quality.factor, y=quality.prediction,
color=quality.prediction.equal, size=counts),
data=df)+
geom_point()+
ylab('predicted quality')+
ggtitle(title)
}
calculate_accuracy=function(df, total_count){
(sum(
subset(df, quality.prediction.equal==TRUE)$counts)
/total_count)*100
}
Now the first model:
wine$pH.factor=factor(wine$pH)
m1=lm(quality ~ alcohol, data=wine)
m2=update(m1, ~ . + volatile.acidity)
m3=update(m2, ~ . + sulphates)
m4=update(m3, ~ . + citric.acid)
m5=update(m4, ~ . + total.sulfur.dioxide)
m6=update(m5, ~ . + density)
m7=update(m6, ~ . + chlorides)
m8=update(m7, ~ . + fixed.acidity)
m9=update(m8, ~ . + pH)
mtable(m1, m2, m3, m4, m5, m7, m8, m9)
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wine)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = wine)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates,
## data = wine)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## citric.acid, data = wine)
## m5: lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## citric.acid + total.sulfur.dioxide, data = wine)
## m7: lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## citric.acid + total.sulfur.dioxide + density + chlorides,
## data = wine)
## m8: lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## citric.acid + total.sulfur.dioxide + density + chlorides +
## fixed.acidity, data = wine)
## m9: lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## citric.acid + total.sulfur.dioxide + density + chlorides +
## fixed.acidity + pH, data = wine)
##
## =============================================================================================================
## m1 m2 m3 m4 m5 m7 m8 m9
## -------------------------------------------------------------------------------------------------------------
## (Intercept) 1.875*** 3.095*** 2.611*** 2.646*** 2.843*** -0.953 28.165 8.571
## (0.175) (0.184) (0.196) (0.201) (0.205) (11.990) (15.083) (17.045)
## alcohol 0.361*** 0.314*** 0.309*** 0.309*** 0.295*** 0.280*** 0.268*** 0.293***
## (0.017) (0.016) (0.016) (0.016) (0.016) (0.020) (0.021) (0.023)
## volatile.acidity -1.384*** -1.221*** -1.265*** -1.222*** -1.114*** -1.137*** -1.131***
## (0.095) (0.097) (0.113) (0.112) (0.120) (0.120) (0.119)
## sulphates 0.679*** 0.696*** 0.721*** 0.903*** 0.916*** 0.898***
## (0.101) (0.103) (0.103) (0.112) (0.112) (0.112)
## citric.acid -0.079 -0.043 0.044 -0.198 -0.224
## (0.104) (0.104) (0.124) (0.145) (0.145)
## total.sulfur.dioxide -0.002*** -0.002*** -0.002*** -0.002***
## (0.001) (0.001) (0.001) (0.001)
## density 3.923 -25.583 -4.343
## (11.944) (15.122) (17.404)
## chlorides -1.747*** -1.584*** -1.827***
## (0.406) (0.408) (0.419)
## fixed.acidity 0.055** 0.016
## (0.017) (0.023)
## pH -0.442*
## (0.180)
## -------------------------------------------------------------------------------------------------------------
## R-squared 0.227 0.317 0.336 0.336 0.344 0.352 0.356 0.358
## adj. R-squared 0.226 0.316 0.335 0.334 0.342 0.349 0.353 0.355
## sigma 0.710 0.668 0.659 0.659 0.655 0.652 0.650 0.649
## F 468.267 370.379 268.912 201.777 166.962 123.298 109.751 98.534
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1621.814 -1599.384 -1599.093 -1589.749 -1580.138 -1575.112 -1572.088
## Deviance 805.870 711.796 692.105 691.852 683.814 675.643 671.409 668.874
## AIC 3448.114 3251.628 3208.768 3210.186 3193.499 3178.276 3170.224 3166.177
## BIC 3464.245 3273.136 3235.654 3242.448 3231.138 3226.670 3223.995 3225.325
## N 1599 1599 1599 1599 1599 1599 1599 1599
## =============================================================================================================
wine$quality.prediction=round(fitted(m9))
wine_by_quality_and_pred=prediction_test_and_group(wine)
plot_predictions(wine_by_quality_and_pred,
"Predicted Quality (Linear Regression Model) vs.
Actual Quality")
calculate_accuracy(wine_by_quality_and_pred, length(wine$quality.prediction))
## [1] 59.28705
As we can see above, we created a linear regression model, then created a new variable in our data frame, quality.prediction, and quality.prediction.equal which is True if the predicted quality is equal to the actual quality, and false if not. We then create a new dataframe, by grouping our data on quality, quality.prediction and quality.prediction.equal, then we summarize to get the count of data in each of these groups. Then we plot these.Let’s try another model, and see how well can we do. This model will be the SVM model (Support Vector Machine). for this model, I’ll change quality to factor too, I want my model to be a classification model because actually quality isn’t an integer but a classification. Let’s see what we get:
#Let's first drop the columns that we don't need anymore.
wine$pH.groups=NULL
wine$quality.prediction=NULL
wine$quality.prediction.equal=NULL
wine$pH.factor=NULL
model = svm(quality.factor~., data = wine)
wine$quality.prediction=fitted(model)
wine_by_quality_and_pred=prediction_test_and_group(wine)
summary(model)
##
## Call:
## svm(formula = quality.factor ~ ., data = wine)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.08333333
##
## Number of Support Vectors: 482
##
## ( 169 162 87 36 18 10 )
##
##
## Number of Classes: 6
##
## Levels:
## 3 4 5 6 7 8
plot_predictions(wine_by_quality_and_pred,
"Predicted Quality (SVM) vs. Actual Quality")
calculate_accuracy(wine_by_quality_and_pred, length(wine$quality.prediction))
## [1] 99.49969
Doing the same thing for our SVM as we did for our regression model. We can see a VERY big difference in our predictions. 99.5% accuracy! Looking at the graph we can barely see a red point indicating few wrong predictions.
Let’s try playing a bit with our features with respect to the stuff we studied above, let’s see if we can increase our model’s score. Here we go:
feature_cols=c('alcohol', 'volatile.acidity', 'citric.acid',
'total.sulfur.dioxide', 'density', 'quality.factor', 'quality')
wine_for_svm=wine[,(names(wine) %in% feature_cols)]
model = svm(quality.factor~., data = wine_for_svm)
wine_for_svm$quality.prediction=fitted(model)
wine_by_quality_and_pred=prediction_test_and_group(wine_for_svm)
plot_predictions(wine_by_quality_and_pred,
"Predicted Quality (SVM) vs. Actual Quality")
No more “False” predictions, that’s a 100% accuracy, as good as it can get!
calculate_accuracy(wine_by_quality_and_pred, length(wine$quality.prediction))
## [1] 100
Since most of our data lies between qualities 5 and 6 as we’ve seen in the first section of this report, we have 1319 in these 2 qualities (82.5% of our data). We can’t really say that we can totally rely on our 100% accuracy, so I downloaded another csv file to run my model against it and see what we get.
file:
https://onlinecourses.science.psu.edu/stat857/sites/onlinecourses.science.psu.edu.stat857/files/Training50_winedata.csv
I manually deleted the last column, rt.sulfur.dioxide, since we don’t have it in our original data.
Here we go:
test_data=read.csv('training_wine.csv')
test_data$X=NULL
dim(test_data)
## [1] 2037 12
str(test_data)
## 'data.frame': 2037 obs. of 12 variables:
## $ fixed.acidity : num 7.5 6.3 7 7.4 8.1 6.6 7.3 6.9 8.5 7.2 ...
## $ volatile.acidity : num 0.33 0.27 0.3 0.38 0.12 0.2 0.26 0.32 0.18 0.27 ...
## $ citric.acid : num 0.32 0.29 0.51 0.27 0.38 0.38 0.36 0.17 0.3 0.28 ...
## $ residual.sugar : num 11.1 12.2 13.6 7.5 0.9 7.9 5.2 7.6 1.1 15.2 ...
## $ chlorides : num 0.036 0.044 0.05 0.041 0.034 0.052 0.04 0.042 0.028 0.046 ...
## $ free.sulfur.dioxide : num 25 59 40 24 36 30 31 69 34 6 ...
## $ total.sulfur.dioxide: num 119 196 168 160 86 145 141 219 95 41 ...
## $ density : num 0.996 0.998 0.998 0.995 0.99 ...
## $ pH : num 3.15 3.14 3.07 3.17 2.8 3.32 3.16 3.13 2.83 3.17 ...
## $ sulphates : num 0.34 0.4 0.52 0.43 0.55 0.56 0.59 0.4 0.36 0.39 ...
## $ alcohol : num 10.5 8.8 9.6 10 12 11 11 8.9 10 10.9 ...
## $ quality : int 6 6 7 5 6 7 6 5 4 6 ...
That’s a bit bigger data frame, than our actual data, but we have the same fields list in it as we can see above. Let’s go, let’s first take out all the fields that aren’t in our model, and create the ones that are, but not in the orginal data. Then we can test.
Let’s just take a quick look at the quality distribution of the dataset, and then go on with our testing.
ggplot(aes(x=quality), data=test_data)+
geom_histogram(binwidth=.1, fill='blue')+
scale_x_continuous(breaks=c(3:9))+
scale_y_continuous(breaks=seq(0, 950, 50))+
ggtitle('Test Data Wine Quality Counts')
## Warning: position_stack requires constant width: output may be incorrect
Well there isn’t a big difference in the distribution of this dataset, We have a bit less 3, 4 and 5 qualities. It has about 1.5 more 6 quality, about double quality 7, a bit more in 8 quality and a small number of 9’s.
test_data$quality.factor=factor(test_data$quality)
test_data=test_data[,(names(test_data) %in% feature_cols)]
test_data$quality.prediction=predict(model, test_data)
test_data_by_quality_and_pred=prediction_test_and_group(test_data)
plot_predictions(test_data_by_quality_and_pred,
"Predicted Quality (SVM) vs. Actual Quality")
Interesting! Looking at the plot we can also barely see the red points, we see more than the plot before, but it is still not a lot. Let’s see how much is our accuracy on this test data.
calculate_accuracy(test_data_by_quality_and_pred,
length(test_data$quality.prediction))
## [1] 89.34708
Here we go, a nice 89.35% accuracy, that’s lower than the 100% we got on our actual data, but that’s still high. Well I really didn’t expect that! ggplot(aes(x=quality.factor, y=alcohol, color=quality.factor), data=wine)+
geom_boxplot()+
scale_color_manual(values=rev(brewer.pal(9, "Blues")),
name="Quality (classifier)")+
ylab('Alcohol (% by volume)')+
xlab('Quality (Classifier)')+
ggtitle('Alcohol vs. Median Quality')
ggplot(aes(x=quality.factor, y=volatile.acidity, color=quality.factor),
data=wine)+
geom_boxplot()+
scale_color_manual(values=rev(brewer.pal(9, "Blues")),
name="Quality (classifier)")+
xlab('Quality (Classifier)')+
ylab('Volatile Acidity (g / dm^3)')+
ggtitle('Volatile acidity vs. Quality')
ggplot(aes(x=volatile.acidity, y=citric.acid,
color=quality.factor),
data=subset(wine, volatile.acidity>=.2 & volatile.acidity<=1.2))+
geom_point(stat='summary', fun.y=median)+
scale_x_continuous(breaks=seq(0.2, 1.2, .2))+
scale_size_discrete(guide = guide_legend(title = 'Quality (classifier)'))+
scale_color_manual(values=rev(brewer.pal(9, 'Blues')),
guide = guide_legend(title = 'Quality (classifier)'))+
ylab('Citric Acid (g / dm^3)')+
xlab('Volatile Acidity (g / dm^3)')+
ggtitle('Volatile Acidity vs. Citric Acid,
colored, sized by Quality')